Fully Quantized Transformer for Machine Translation
123
(a)
Feed-forward
Networks.
(b) Scaled Dot-Product Attention.
(c) Multi-Head Self-Attention.
FIGURE 5.3
(a) Feed-forward Networks. (b) Scaled Dot-Product Attention. (c) Multi-Head Self-
Attention.
second or higher-dimension tensors. For all other operations, such as sums, the computa-
tional cost added by the quantization operation outweighs the benefit of operating with
reduced precision. As a result, they do not quantize such operations. More precisely, all
weights of the Transformer are quantized, excluding biases, due to the biases being summed
with the INT32 output of matrix multiplications, which provide no additional computational
efficiency from being quantized. Furthermore, the memory space of biases is insignificant
compared to the weight matrices. The biases only represent less than 0.1% of total weights.
As for positional embeddings, the authors quantized the embeddings once before training
due to the fixed positional embeddings. The γ weights of LayerNorms are also quantized.
For activations, the authors quantize the sum of the input embeddings with the positional
encodings in both the encoder and decoder. The (Q, K, V ) matrixs within the multi-head
self-attention are quantized. Also, the softmax’s numerator, the softmax’s denominator, the
softmax’s output, and the scaled dot-product attention’s output are quantized, as shown
in Fig. 5.3(b) and Fig. 5.3(c). At the inference stage, the authors adopt the exponential
function to replace the softmax to make the full-precision exponential function a low-bit
format. For the position-wise feed-forward networks, they quantize the output of the ReLUs
and the feed-forward themselves, as shown in Fig. 5.3(a). Finally, for all LayerNorms, we
quantize the numerator x −μ, the denominator
√
σ2 + ϵ, their quotient, and the output of
the LayerNorm.
5.2.3
Tensor Bucketing
The authors adopt tensor bucketing, where they quantize subsets of the tensor with each
set of quantization parameters instead of using a single set of quantization parameters per
quantized tensor. Even though this adds more scalars, the memory cost is insignificant
overall. Furthermore, the authors argue that the added flexibility can significantly alleviate
the precision loss, thanks to all values being mapped to a single low numerical precision
domain. This tensor bucketing method uses several subsets equal to the output dimension